Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Algorithms for Binary Neural Networks

∂LAdv p

∂C^lp

= −

2(1 −Dp(T ^l

p,i^;^Y^p⁾⁾^∂D^p

∂C^lp

(3.95)

Furthermore,

∂LData p

∂C^lp

= ¹

(Rp −Tp) ^∂T^p

∂C^lp

(3.96)

The complete training process is summarized in Algorithm 4, including the update of the

discriminators.

Algorithm 4 Pruned RBCN

Input: The training dataset, the pre-trained 1-bit CNNs model, the feature maps Rp from

the pre-trained model, the pruning rate, and the hyper-parameters, including the initial

learning rate, weight decay, convolution stride, and padding size.

Output: The pruned RBCN with updated parameters Wp, ^ˆWp, Mp and Cp.

1: repeat

Randomly sample a mini-batch;

// Forward propagation

Training a pruned architecture // Using Eq.17-22

for all l = 1 to L convolutional layer do

F ^l

out,p ⁼^Conv⁽^F^l

in,p^,^{( ˆ}^W^l

p ^◦^M^p⁾^⊙^C^l

p^);

end for

// Backward propagation

for all l = L to 1 do

10:

Update the discriminators D^l

p⁽^·^{) by ascending their stochastic gradients:}

11:

∇Dlp(log(D^l

p⁽^R^l

p^;^Yp^{)) +}^log⁽¹⁻^D^l

p⁽^T^l

p^;^Yp^{)) +}^log⁽^D^l

p⁽^Tp^;^Yp^)));

12:

Update soft mask Mp by FISTA // Using Eq. 24-26

13:

Calculate the gradients δW l

p^{; // Using Eq. 27-31}

14:

W ^l

p ^←^W^l

p ⁻^η^p,¹^δW ^l

p^{; // Update the weights}

15:

Calculate the gradient δCl

p^{; // Using Eq. 32-36}

16:

C^l

p ^←^C^l

p ⁻^η^p,²^δC^l

p^{; // Update the learnable matrix}

17:

end for

18: until the maximum epoch

19: ^ˆW = sign(W).

3.6.4

Ablation Study

This section studies the performance contributions of the kernel approximation, the GAN,

and the update strategy (we ﬁx the parameters of the convolutional layers and update the

other layers). CIFAR100 and ResNet18 with diﬀerent kernel stages are used.

1) We replace the convolution in Bi-Real Net with our kernel approximation (RBConv)

and compare the results. As shown in the column of “Bi” and “R” in Table 3.3, RBCN

achieves an improvement in accuracy 1.62% over Bi-Real Net (56.54% vs. 54.92%) using

the same network structure as in ResNet18. This signiﬁcant improvement veriﬁes the eﬀec-

tiveness of the learnable matrices.

2) Using GAN makes RBCN improve 2.59% (59.13% vs. 56.54%) with the kernel stage

of 32-32-64-128, which shows that GAN can help mitigate the problem of being trapped in

poor local minima.